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fS| ' Abstract 

| The multi-armed bandit is a concise model for the problem of iterated decision-making under 

uncertainty. In each round, a gambler must pull one of K arms of a slot machine, without any 
foreknowledge of their payouts, except that they are uniformly bounded. A standard objective is 
to minimize the gambler's regret, defined as the gambler's total payout minus the largest payout 
which would have been achieved by any fixed arm, in hindsight. Note that the gambler is only 
i— !■ told the payout for the arm actually chosen, not for the unchosen arms. 

Almost all previous work on this problem assumed the payouts to be non- adaptive, in the 
£^ ■ sense that the distribution of the payout of arm j in round i is completely independent of the 

^ | choices made by the gambler on rounds — 1. In the more general model of adaptive 

O ■ payouts, the payouts in round i may depend arbitrarily on the history of past choices made by 

the algorithm. 

t-H ■ We present a new algorithm for this problem, and prove nearly optimal guarantees for the 

regret against both non-adaptive and adaptive adversaries. After T rounds, our algorithm has 
regret 0(y/T) with high probability (the tail probability decays exponentially). This dependence 
on T is best possible, and matches that of the full-information version of the problem, in which 
^vq ■ the gambler is told the payouts for all K arms after each round. 

Previously, even for non-adaptive payouts, the best high-probability bounds known were 
■ 0(T 2 / 3 ), due to Auer, Cesa-Bianchi, Freund and Schapire p^. For non-adaptive payouts, they 

also proved an O(VT) bound on expected regret. We describe an adaptive payout scheme for 
which the expected regret of their algorithm is S1(T 2 / 3 ). 

> '■ 

1 Introduction 

?— i ' 

In problems of "online decision making," a sequence of choices must be made without knowledge 
of the future. Typically, each decision results in a certain "cost" or "reward," which is immediately 
revealed to the algorithm for use in later decision-making. 

In this setting, a common goal is to minimize the "regret" of the algorithm, defined as the 
algorithm's net cost minus the least cost, which would, in hindsight, have been incurred by a different 
decision sequence. This is clearly a hopeless task unless the decision sequences the algorithm will 
compare itself against are restricted somehow. Often this is achieved by only considering strategies 
that make a single decision and use it every round. 
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Imagine a slot machine with K arms. A gambler plays the slot machine repeatedly, each time 
by pulling one of the arms and paying its cost for that round. At the end of the day, the gambler 
wants his total cost to be not much more than that of the best single arm in hindsight. 

We will focus on two different models of feedback to the gambler. In the full information setting, 
the costs of all K arms are revealed to the gambler after each round. In the bandit setting, only 
the cost of the chosen arm is revealed, making life more difficult for the gambler. 

1.1 Adaptive and non-adaptive cost allocation 

Much of the early work in this area (see, e.g., |12l was based on the assumption that each 
machine has a time-invariant distribution, from which its cost is sampled independently in each 
round. In that case, the gambler's problem can be viewed as "learning" these distributions, while 
avoiding the unfavorable machines. 

Subsequently, many papers have removed the assumption that the cost distribution is fixed, 
instead allowing it to vary arbitrarily over time. However, most of these results still assume that 
the way the costs vary is completely unaffected by the choices made by the gambler. We will refer 
to this model as non- adaptive cost allocation. Note that non- adaptive cost allocations are not 
required to be deterministic, and hence include time-invariant cost allocation as a special case. 

In both of the above settings, essentially everything is known about the expected regret. The 
best possible expected regret in the full information version is known to be @(\/T logi^T) [110 IE] 
and in the bandit version, 0(^TK log K) [TJ H]and n(VTK). The lower bounds hold even for 
time-invariant cost distributions, and the upper bounds hold for any non-adaptive cost allocation. 

It is also natural to consider, for a parameter e > 0, what are the best regret bounds which can be 
guaranteed with probability at least 1 — e. For this stronger type of guarantee, the answer does not 
change much for the full information version; the right answer is 0(y / T(log K + log 1/e)). However, 
for the bandit version of the problem, the best previously known high probability bounds were of 
the form 0(T 2 ^ (K log K) 1 / 3 ), even for e = 1/3. We will improve these to 0{^TK logKlogl/e). 

In the more general setting of adaptive cost allocation, the costs are additionally allowed to 
depend on the decisions made by the gambler in previous rounds. The only independence require- 
ment is that the gambler is allowed to randomize his current decision independently of the costs 
being set for the current round. 

The best known lower bounds in the adaptive costs framework are the same (at least up to 
constant factors) as in the non-adaptive framework. In the full-information setting, adaptive payout 
cannot force any more expected regret than non-adaptive payouts can (see [3J Theorem 3.1]). In 
that setting, the optimal expected regret is Q^T logK). 

In the bandit version however, adaptive payouts are strictly more powerful than non-adaptive. 
This can be seen by a brute force calculation of the optimal expected regret for the tiny example 
K = 2, T = 2. Here the worst expected regret that can be forced by adaptive costs is 2/3, whereas 
non-adaptive costs can only force expected regret 1/2. The present paper addresses the question, 
"How much more regret can an adaptive cost sequence force?" As we shall demonstrate, the answer 
is, "At most a constant factor." 

In the bandit setting, the best previously known upper bound was <3(T 2 / 3 (K log K) 1 / 3 ), due 
to Auer, Cesa-Bianchi, Freund and Schapire pp. Our main contribution is a new algorithm for the 
bandit model which improves this upper bound to 0(\/TWTogK). This algorithm, which we call 
"Accounts," is stated in Section 2. 
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Theorem 1.1. Let R denote the regret of the "Accounts" algorithm for the K-armed bandit, on 
any adaptively chosen cost sequence of length T . Then, for every a > 1, 

Pr (r > (a + 7)\/TKmK^ < lOOOif^/aexp 

It follows that 

E(R) = 0(y/TK In K). 

None of the constants in Theorem 11.11 is tight. In particular, the ^fa controlling the rate of 
exponential decay can be replaced by a 1 ~ £ , for any e > 0, at some expense to the other parameters. 
Our emphasis in this paper is on the main new theoretical ideas. 

It is easy to see that every algorithm for the if-armed bandit problem has expected regret 
Q,(y/TK) for a time-invariant cost allocation scheme in which one of the machines pays 1 with 
probability 1/2 + \J K/T and otherwise, and the other K — 1 machines all pay 1 and each 
with probability 1/2. Roughly speaking, the random noise from the coin flips is enough to hide the 
identity of the best machine for about T/K observations; meanwhile, because we are in the bandit 
model, there are only enough rounds to observe each machine about T/K times. This shows that 
our upper bound on expected regret is within a 0{\J\ogK) factor of being optimal. 

We also show that the regret bound on the Exp3 algorithm of Auer et al. cannot be sub- 
stantially improved for adaptive costs, resolving an open question from that paper. 

Theorem 1.2. For any K > 2 and parameters r/ = rj(T), 7 = 7(T), there exists an adaptive cost 
schedule such that 

E(i?) = ft(T 2/3 ), 

where R denotes the regret for the Exp3 algorithm with "exploration probability" 7 and "sensitivity" 
V- 

We prove Theorem 11.21 in Sectional 

Remark 1.3. Auer et al. 2_ incorrectly asserted that Exp3 has expected regret 0(VT) for adaptive 
costs. What is true is that, if 

T 

Rj := x t • c l - ej ■ c*. 

t=i 

denotes the "regret against arm j," then 

maxE (Rj) = 0{VT) 

j 

holds for adaptive cost schedules. This expression equals the expected regret in the case of non- 
adaptive costs (where the index j achieving the maximum is deterministic). However, in general 
the inequality 

E (R) = E ^maxi^ > maxE (Rj) 

may be very far from equality, as observed by, among others, McMahan and Blum |11| Appendix 
B]. 
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1.2 The "Accounts" algorithm 

Our algorithm makes use of a "multiplicative update rule," selecting each arm randomly with 
probabilities that evolve based on their past performance. This principle is well-established in the 
literature; indeed, our algorithm can be seen as a direct descendant of the Exp3 algorithm of 
Auer et al. pQ. That algorithm can in turn be viewed as the "natural" modification of the Hedge 
algorithm of Freund and Schapire [I] to the bandit setting. The Hedge algorithm is itself closely 
related to several earlier algorithms (see, e. g., Littlestone and Warmuth [S] and Vovk J3]). Indeed, 
as recently observed by Kalai the Hedge algorithm can be seen as a special case of the "Follow 
the Perturbed Leader" algorithm (where the components of the perturbation are logarithms of 
exponentially distributed random variables). 

The first step in modifying a full-information algorithm for use in the bandit setting is to devise 
a way to accurately guess the missing information. By "missing information" we do not mean the 
entire sequence of cost vectors so far (which would be impossible to estimate well), but rather the 
sum of cost vectors so far, since this is all the full-information algorithm needs. This is achieved 
by constructing a random variable whose expectation is the true cost vector, and whose value is 
computable by the algorithm from the cost of the one randomly chosen arm. Chernoff-Hoeffding 
bounds ensure that the sum of these random variables almost surely converges to the sum of the 
true cost vectors. 

Unfortunately, when the probability of choosing a particular arm becomes small, the variance 
in the estimate of that arm's cost becomes large, which hurts the convergence rate for the sum. 
On the other hand, in order to approach the performance of the best arm, the algorithm needs to 
decrease its probabilities for choosing arms of higher cost. The Exp3 algorithm of Auer et al. 
balances these considerations by establishing a mixed "minimum exploration rate" of j/K, but 
using the multiplicative weights rule to allocate the remaining 1 — 7 probability. 

Our algorithm instead has, for each arm, a sliding minimum exploration rate, which starts 
at approximately 1/K, and may decrease to 0(l/yT), depending on the costs encountered. This 
minimum exploration rate is enforced by means of an account Aj, for each arm j. Roughly speaking, 
the account fills up account Aj with "negative regret" for its performance relative to arm j. The 
minimum exploration rate g(Aj) is defined in such a way that the cumulative regret caused by the 
variance due to exploring at rate g(Aj) over the remaining rounds will almost surely be less than 
the negative regret already stored in Aj. 

Put another way, when the exploration probabilities are all above the minimum rates, the 
algorithm updates these probabilities using the usual multiplicative updates rule. If this rule 
results in a probability dropping too low, then instead of decreasing the exploration probability 
further, the algorithm instead adds the estimated cost vector into the account for that arm, and 
keeps the exploration probability the same. This has the result of initially elevating the exploration 
probabilities, but eventually letting them drop, for arms that perform consistently badly. 

2 The model 

We model the problem as a two-player zero-sum game between a gambler and a (rigged) casino. The 
number of rounds T is fixed in advance, and known to both players. In each round i, the gambler 
chooses one arm M % 6 {1, . . . , K}, while the casino simultaneously chooses costs c\,...,c l K in some 
fixed bounded interval, which for notational convenience we take to be [0, 1]. After these choices 
are made, the casino is informed of the gambler's choice M % , and the gambler is informed of the 
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cost c l M i for his chosen arm. At the end of the game, the gambler's loss is defined as YlJ=i c M i ' 
The gambler's regret is defined as the difference 

T T 
i=l i=l 

Since we are interested in minimizing regret, we may think of R as the payout to the casino, and 
— R as the payout to the gambler. 



3 The Algorithm 

Let S C M. K denote the simplex of probability distributions over {1, . . . , K}. Our algorithm is 
defined in terms of two functions / : R K — ► S and g : M>o — > [0, 1]. The boldface variables are 
vectors in M. K . 

Algorithm 3.1: Accounts(/, g) g 

C := A := 0. 
for i := 1 to T 

Set p = (pi, . . . ,p K ) = /(C). 

Sample M = M l from 1, . . . , K according to the distribution p. 
Pull arm M. Observe and incur cost c^. I 
if 9 (A M ) < p M 

then C M := C M + j£ 
else A M := A M + ^7 



Henceforth, we will work with the following specific choice of /. Let r/ = \J\uKjTK . For 
z = (zi, . . . , z K ) € R K , and j £ {1, . . . , K}, let 



Ee=i e ve 



We define our barrier function g by 

g(x) = max i r] 



' K{l + x/ef/ 2 

where = V KT In K. As will be fairly easily seen from our proof, if one's goal is only to derive an 
0(Vr) bound on expected regret, the exponent 3/2 may be replaced by any other value strictly 
between 1 and 2. 

The innovative part of our algorithm is the introduction of the "accounts" vector A, as well as 
the "moving barriers" g(Aj). Without this device, some of the exploration probabilities could be 
made too small, resulting in too-high variance for the behavior of the algorithm compared to the 
arms of small probability. The basic idea is that the barriers enforce a lower bound on exploration 
probabilities for a given arm until the corresponding account has accumulated enough "negative 
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regret" compared to that arm to "pay in advance" for the higher variance that may result after the 
barrier is lowered. Fortunately for us, the barriers can be lowered quite quickly and still achieve 
this goal, so that all of the barrier-created exploration combined is highly unlikely to result in more 
than 0{yTK) total cost to the algorithm. 



3.1 Conventions 

In our analysis, we will use a compact notation to refer to the values taken on by the variables of 
our algorithm during the sequence of trials. For < i < T, a superscript i will indicate the value 
of the variable at the end of round i. Thus for instance, p l j = fj(C l ) for 1 < i < T. We consider 
all variables to have value at the end of round 0. 

For j G {1, ...,K}, i G {0, ...,T}, let Rj denote "regret with respect to arm j at time i" 
defined as 

i 

i=i 

Note that this gives us a new formula for the final regret, R, namely, 



R = max R 4 . 

For j G {!,..., K}, let &j denote the following function from 



1 ^ 1 _ 1 ^^2f=i e 



<3?j(z) := - In = - In 





rj fj{ z ) V e 
This definition implies that, for each j, V$y = ej — f, i. e., for each i, j, 

if i = 3 
otherwise. 

The motivation for defining $j is that it acts as a potential function, controlling the change in 
Rj. This potential function is used in standard proofs of regret bounds for the weighted majority 
algorithm and variants. Let = $j(C*), where C* is the estimate for the sum of the costs after 
round i. One can easily see that <3?° = ^ \nK and $J > 0. Thus the decrease $° — $J in potential 

over the entire game is bounded above by 

We will use A to denote the difference operator. Specifically, for j G [K] and i G [T], we will 
denote 



AR) 


'— R ) 




A<Fj 


:=*} 


- &r 






3 


AA) 


:=4 


-4" 



Within the context of our proof of Theorem II. 1( for a particular value of a, we will restrict our 
attention to a fixed adaptive adversary, whose strategy maximizes Pr (R > a). This adversary may 
be assumed to be deterministic, in the sense that each cost function c l is a deterministic function 
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of the previous decisions M , . . . , M !_1 made by the algorithm. We will denote by TLi the "history 
of the game prior to round i" that is, the cr-algebra generated by M , . . . , M l_1 . 
We also introduce the notations 

Yj := AR) + A$j + A A) - E ( AR) + A$j + A A) \ Hi) , 

T 
i=l 



Note that the definition of Yj depends on the particular adversary in question. Also note that Yj 
is a martingale difference sequence. 

We will use ei, . . . , &k to denote the standard basis for R . 

4 Outline of the Proof 

The proof of our theorem splits into two main parts. First, we prove a weakly exponential tail 
inequality for the random variable Yj — Aj . Essentially, this says that the account value Aj rarely 
underestimates by much the contribution of "stepwise variance" to Rj. 

Lemma 4.1. Let 1 < j < K. Then, for every a > 1, 

/a In if 



Pr (15 - Aj > („ + DVTK^K) < (i^ + J|) exp 



Second, we will prove that the contribution of "stepwise expectations" to Rj + Aj exhibits an 
even sharper cut-off at 0{\JTK log if). 

Lemma 4.2. 

' -3\/TKlnK\ 
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In order to prove these lemmas, we will need some results about martingales which we will 
describe in Sectional We will then prove Lemmas 14. II and 14.21 in Sections |U and respectively. In 
the remainder of this section we will show how Theorem 11.11 follows from the two lemmas. 

Proof of Theorem \l.l\ Since R cannot exceed T, we may assume without loss of generality that 
(a + 7)\/TK InK < T; otherwise, there is nothing to prove. Fix an arm j. By the definition of Yj, 
we have 

T T 



Yj =J2 Y 1 = R J - R J + $ J - $ i + A J ~ A °j ~ Yj E ( ARi 3 + + A 4 I H i) ■ 

i=l i=l 

Since i?° = A] = and $° - $J < $(0) = ^JL = yjTK InK, this implies 

T 

Rj < Yj - Aj + E ( AR) + A$/ + AAj | Hi) + VTKlnK 



i=l 
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Suppose the high probability bound of Lemma 14.21 holds, and that, for every 1 < j < K , the 
high probability bound of Lemma 14.11 holds for Yj — Aj. Then, summing, we obtain the desired 
bound on the regret, 

R < max Rj < (a + 7)VTK InK. 

j 

Summing the error probabilities completes the proof for the tail inequality, 

„ /„ / „s rr^77\ T A „(Vayfa 128 \ / JahxK\ f -3^/TK\nK\ 
P r (i J >( a + 7)vWl^)<j r ^ + i -^jexp^^ 1 — j +<B p^ - j. 

Noting that the second term is dominated by the first term, and approximating crudely using the 
facts a > 1 and K > 2, we can infer that 

^ f n , ~ 1777777— rA IWKJ^i ( Ja1nK\ 
Pr [R>(a + 7)V T K In KJ < * exp l-Z—^ J . 

To prove the upper bound on expectation, we note that, in general, 

/•oo 

E(R) < E(max{#,0}) = / Pr{R>x)dx. 

Jo 

Using the trivial bound Pr (i? > x) < 1 for small values of x, and our main tail inequality for larger 
values of x, we deduce, after several steps, the desired estimate. □ 

5 Concentration Inequalities for Martingales 

The well-known Hoeffding-Azuma inequality bounds the probability of large deviation for a mar- 
tingale with uniformly bounded step sizes. In our analysis, we will need rather tight bounds on the 
deviation probabilities for a martingale whose step sizes are bounded by l/p l , which is a random 
variable. Fortunately, the quality of the bound attained is allowed to depend on the step sizes 
actually encountered. 

The following strong generalization of the Hoeffding-Azuma inequality, due to McDiarmid [101 
Theorem 3.15], does essentially what we want. In order to state his bound, we need to first introduce 
some concepts related to the notion of conditional expectation. 

Recall that, for random variables A and B over a finite probability space Vt, the conditional 
expectation E(A\B) is the random variable which, on each atom {B = b} of B, takes value 
E(A\B = b), which is the expectation of A in the corresponding restricted probability space. 
Analogously, we define the conditional variance, Var(^4 | B) to be the random variable which, 
on each atom {B = b} of B, takes value Var (^4 | B = b), defined as the variance of A in the 
corresponding restricted probability space. Again analogously, we define the conditional positive 
deviation, sup(A \ B) to be the random variable which, on each atom {B = b} of B, takes the 
maximum value attained by A on that subset of O. More generally, this is the minimum among all 
S-measurable random variables that are always > A. As usual, if B = (B\, . . . ,B m ) is a tuple of 
random variables, we will also write Var (A \B) as Var (^4 | B\, . . . , B m ), and also sup(^4 | B) as 
sup(A | Bi,...,B m ). 



S 



Theorem 5.1 (McDiarmid). Suppose X%, . . . ,X n is a martingale difference sequence, and b is 
an uniform upper bound on the steps Xi . Let V denote the sum of conditional variances, 



i=l 



V = ^Var(JQ \X 1 ,...,X i - 1 ). 

Then, for every a, v > 0, 
Pr 



( Xi > a and V < v ) < exp ( — — | 

\^ ~ J- y \ 2v + 2ab/3J 



We will also need the following more general formulation, which is an easy consequence. 

Theorem 5.2. Suppose X%, . . . ,X n , V, are as in Theorem \5.1\ Let B denote the maximum "con- 
ditional positive deviation, " 

B = maxsup(Xj | X%, . . . , Xj_i) 

i 

Then, for every a,b,v > 0, 

Pr (J] Xi > a andV < v and B < b) < exp ("^^) • 

Proof. Define X^, . . . ,Xj, inductively by setting X* = Xi unless sup(JQ | Xi, . . . ,Xi-i) > b; in 
that case, set X* = • • • = X* = 0. It is easily verified that X* is also a martingale difference 
sequence, and that b is an absolute upper bound on the step sizes for X* . Moreover, X* behaves 
exactly like X except when B > b. Applying this together with Theorem 15. II implies 



Pr 



X 'i^ a and V <v and B <b^J <~Pr (V X* > a and V* < v 



< exp 



2v + 2ab/3 J ' 

where V* denotes £\ Var ( X* \ Xf , . . . , X?_ x ) . □ 



6 Charging for variance: Proof of Lemma 14.11 

In this section, we prove Lemma 14.11 our tail inequality for Y — Aj . The key ingredient will be 
Lemma 16.61 which is a tail inequality for the "rectangular" events {Yj > £ and Aj < £}. First, 
however, we need to prove several basic facts, which will be used to prove upper bounds on the 
conditional variance and conditional positive deviation of the steps YJ , in terms of the final account 
value Aj. 

We first show that the probability of choosing arm j cannot change drastically from one round 
to the next. 

Proposition 6.1. Let 1 < i < T and 1 < j < K . If arm j is chosen in round i, then 

e -v/v) < tJL- < i. (1) 
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Otherwise, if M = M' 1 ^ j is the arm chosen in round i, then 

l<^<e" c K (2) 
p) 

Proof. Suppose C J_1 = {z\, . . . ,zk)- Since fj is decreasing in its j'th component, and increasing 
in every other component, the inequalities involving 1 follow. Next, observe that when arm j is 
chosen, 



pf 1 _ e -^+ c >P £ /e - 



p 1 - e ^Zj Y^ i6 - , l( z e +s e,j c }/Pj) 



When any other arm M is chosen, 



P) +l _ Ee e ~ vze 



p ) Hi e~^ Zl+5 ^ MC M/P l M) 
1 

" 1 - pl { (l - e-^ii/PM] 



where the last inequality holds for every < p l M < 1 and r\c % M > 0, with equality at rjc l M = 0, as 
can be seen by examining the first partial derivative in x = r/c l M . □ 



Next we bound changes in regret, potential, and account value in terms of the arm probability 



Proposition 6.2. Let 1 < i < T and 1 < j < K. If arm j is chosen in round i, then 

< AR) + + AA) < l/p) 
When any other arm M is chosen, 

-1 < AR) + A<$>) + AAj < 0. 

Proof. More precisely, we show that if arm j is chosen in round i, then AR 1 - = and both A<£* and 
AA) are between and 1/p), with at most one of them nonzero. When an arm M ^ j is chosen in 
round i, we show that AA) = 0, AR 1 - = c l M — c) and — c % M < A<3>* < 0. 

The claims for AR 1 - are immediate from the definition. The bounds on AA) follow because A 1 - is 
only incremented when arm j is chosen (and moreover g(A)y 1 ) > p)), and then AA) = c)/p) < 1/p)- 
To see the bounds on A$*-, note that 




Combining this with Proposition 16. II completes the proof. □ 
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As stated in the Introduction, the purpose of the accounts is to prevent the algorithm from 
overreacting to observed costs and allowing the corresponding probabilities to decrease too quickly. 
The next proposition states that, indeed, for each j, g(Aj) acts as an approximate lower bound 
on the probability of choosing arm j. In particular, the exploration probabilities never drop below 
£1(1/ y/T). In the Exp3 algorithm of Auer et al. ^H], the latter property was enforced by using a 
modified version of /. 

Proposition 6.3. For 1 < i < T and 1 < j < K , 

> g(A i f 1 ) > g(Aj). 

Proof. Since g is decreasing and A) is increasing in i, we have g(A i f 1 ) > g(Aj). We prove by 

induction on i that Pj > g(A t ~ 1 )e~ n ^ g ( A i \ The base case i = 1 is clear, since = fj(0) = 1/K > 
5(0). 



For the inductive step, we consider two cases. If p* +1 > pj, then by inductive hypothesis we 
have 



where the last inequality follows because A) > A 1 - 1 and g is non-mcreasmg. 



pf 1 > p) > g(A i ~ l )e- ri,9{A T } > g(A))e~' n/9 ^\ 
is because > A l ~ 1 

On the other hand, if < pj, then note that j must be the arm chosen in round i + 1 and 
moreover > g(A % ^ 1 )- This also implies that A l j = A l ~ l ■ It follows by Proposition 16. II that 

pi+i > pie-v/p'i > g(A)- 1 )e- 1l/9{A y 1) = g(A))e- v/9 ^\ □ 

Next, we show how to derive good upper bounds on the conditional variances and conditional 
positive deviations for our martingale steps Y-. 

Proposition 6.4. For all Yj < e^/^J) /g(Aj). 

Proof. Let Z) denote AR) + + AAj. Note that by definition Yj = Zj — E (z 
(j, = sup(Z* | TCi) denote the conditional positive deviation of Z % - given Hi. By Proposition 16.21 Z l - 
attains value \x when arm j is chosen, which occurs with probability p*-, and moreover /i < 1/p), 
and when any arm is chosen, Z % - > — 1. It follows that 

E(z\n i )>p) t i-(i- P )). 

From this we infer 

Yj = Z* - E (Z | Hi) < /i - E (Z | Hi) < (1 - p))n + (1 - p)) < 1/p) - p) < 1/p). 
By Proposition 16.31 it follows that 



Hi). Let 



Yj < 1/p) < e v/9(A T ) /g(A i r 1 ) < e v/9(A J ] /g(Aj 



3 >' 

□ 
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Proposition 6.5. For 1 < j < K, 

T 



i=l 



£ Var (Yj \H l )<T(l + e^ A ^/g(Aj] 



Proof. Let 1 < i < T, and condition on Tii, which in particular determines p = V % y By Propo- 
sition inHJ we know that Z = Ai?*- + A$*- + ^-A 1 - is between and 1/p with probability p, and 
otherwise between —1 and 0. Since, for any random variable X, Var (X) < E (X 2 ^j, this implies 
the conditional variance Var (Z \ Hi) < p(l/p) 2 + (1 — p)(—l) 2 < 1/p + l. Since Y} equals Z minus 

Hi) < 1 + 1/p. 



its conditional expectation, it has the same conditional variance. Thus Var yYJ 
By Proposition 16,31 this is at most 1 + ePl g ^ A i ) /g(Aj). Summing over T completes the proof. □ 

Next we prove a tail inequality for the "rectangular" events {Yj > £ and Aj < £}. The key 
insight is that, via Pr op ositions 16 . 4l and 16 . 5l we can apply McDiarmid's inequality to these events. 

Lemma 6.6. For all £ > and 1 < j < K, 

Pr {Yj > ( and Aj < £) < exp (-^r^) • 

Proof. Fix j. By definition, Y^,... ,Yj is a martingale difference sequence with respect to the 

filtration 7i\, . . . ,TCt- Let V denote the sum of conditional variances, V = Yu[=i Var \Yj %i) > 
and let B denote the maximum conditional positive deviation, 

B = max sup (Y/ | Hi). 

i J 

By Propositions E31 and we know B < e v/9{A ^ /g(Aj) and V < T(l + e v/9{A f^/g(Aj)). 
Consequently, for every £, £, we know 



Pr (Y > C and Aj < f ) < Pr [Y > ( and V < T(l + e^ 3 ^ and 5 < e v/9 ^ ) J . 

Applying Theorem 15.21 fMcDiarmid's inequality), this implies 

Pr (Y > C and A? < £) < exp f — — £^ . „. . 

To simplify the denominator of the exponent on the right-hand side, note that it is a decreasing 
function of and that by definition g(£) > Thus, 

2T(#(0 + e" /s(?) ) + 2e r?/9(f) C/3 < 2T(7? + e) + 2eC/3 

< 6T + 2C since r? + e < 3. 



Plugging this into the upper bound on Pr ( Y > £ and < £ j completes the proof. □ 
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Now we are ready to prove the main result of the section, Lemma 14.11 The proof combines 
Lemma 16.61 with a covering argument. We remark that, up to this point, the only properties of g 
which have been used in an essential way are that it is decreasing and has minimum value rj. Now 
we will begin to make use of the specific definition of g. 

Proof of Lemma \4-1\ Let A cvlt := 6(("qK)~ 2 ^ — 1), which is the minimum A such that g(A) = r\. 
Let A max denote the absolute maximum possible value of A'J . (Propositions 16.11 and 16.21 imply 

explicit bounds on A max , but we will not need these.) Let v = (a + l)y/TK In K. 

Our approach is to cover the event {Yj > Aj+v} by rectangles of the form {Yj > £ and Aj < £}. 
More specifically for £ > 1, let & = (a + 1)9 = v + (£ - 1)6 and 



it 



16 if 16 < A crit 
A mSLX otherwise. 



Then the \A crit /6] rectangles {Yj > Q and Aj < for 1 < I < \A crit /6], cover the trapezoid 
{Yj > Aj + v and AJ < A mSLX }, which equals the event {Yj — Aj > u}. 
Claim. For the above definition of Q,£i, 

QgifU) > (a + £) 1/2 \nK 



6T + 2Q 

Proof of Claim. We will show that the ratio 

C?g(6) o 2 

{6T + 2Q){a + e) 1 / 2 ~ K(6T + 2(6 + A clit ))- 

First, observe that r is an increasing function of a, and hence we may assume a = 1. In the case 
when 16 < A CT i t , this gives us 

6 2 6 2 

> 



" K(QT + 2(1 + t)6) K(6T + 2(6 + A crit )) ' 
In the case when £6 > ^4 C rit> w e have 

_ (1 + 1)3/2^ 

6T + 2(l + £)6' 

which differentiation shows is a decreasing function of £, and is hence minimized when £ = A CT - lt /6. 
Substituting this, and using the fact that g(A cr \t) = ??, we have 

(1 + A CTit /6) 3 / 2 e 2 v 6 2 



6T + 2(6 + A crit ) K(6T + 2(6 + A crit )) 



Next, observe that, since we are assuming T > y/TK InK, it follows that 

6 + A cvit = T 5/6 (K In K) 1/6 < T. 
Since 6 2 = TK In K, this completes the proof of the Claim. 
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Applying Lemma 16.61 in conjunction with a union bound, we have 

Pr [Yj — Aj > u ) < Pr (Yj > Q and Aj < by our covering argument 

1=1 

< exp ( J- — ) by Lemma ESI 

exp | — - — ^— ^ | by the Claim 

oo 

£=1 \ / 

-i>( J ° + T' J 1 j ' 

128 



(a + ^) 1 / 2 lnif' 



, yexp(-y)dy 

in A Jy=(^\nK)/S 

128 00 
: (y+ l)exp(-y) 



In 2 K 



y=(s/alnK)/8 



(\<qJol 128 \ / J^lnK 

= lw + h7Kj exp l §— 

This completes the proof of Lemma 14. II □ 

7 Bounding the sum of conditional expectations: Proof of Lemma 14. 21 

In this section, we prove Lemma 14.21 our tail inequality for the sum of conditional expectations 
E ( AR) + A<3>j* + AA) Hi ) . We will first need the following bound on the accuracy of linear 



approximation to the potential function 

Lemma 7.1. For i £ [T], j £ [K], regardless of the history, 



n l )+ V K. 



E ( AS} I Hi) < V^C 4 " 1 ) • E (^AC 

Proof of Lemma\71[ Let i £ [T] be fixed. A$j = *j(C*) - $ 3 -(C i_1 ). Once the history Hi has 

been fixed, C* _1 is determined. Let z = C* -1 , and let z + Az = C\ Note that Az is a random 

c « 

variable that takes value -f with probability p». We want to show that 

E ($j-(z + Az) - $j-(z)) = V$j(z) • E (Az) + r]K. 

For any vector v, let D v denote the (normalized) partial differential operator in the direction of v. 
We abbreviate D e< , by D^. By Taylor's theorem, for some a £ [0, 1], 

*j-(z + Az) - $j(z) = D Az || Az|| + D Az $ 3 -(z + aAz) || Az|| 2 . 
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Taking expectations over Az, this becomes 

E ($ 3 -(z + Az) - $j(z)) = V$j(z) • E (Az) + E (D^ z *j(z + oAz) || Az 

So we just need to show that the expectation of the second order term has the right bound. 
Recall that V3?j = ej — /. Differentiating once more shows, for any vector x, 

D? %(x) = - D, /<(x) = r//Kx)(l - /,(x)) < r//Kx). 

Applying this to the second-order term from Taylor's theorem yields 

K 



E D Az $j(z + aAz) ||Az|j 



K ( \ 



< 



where the last inequality follows because is decreasing in its £ th component. 



□ 



We now prove Lemma 14.21 our concentration bound on the sum of conditional expectations 



Ei E ( AR) + A$*. + AA 



Proof of Lemma \4-2\ For 1 < j < K, 1 < i < T, let K- be the indicator variable for the event 
{<?(A* _1 ) < p*} (i. e,. the algorithm is not at the barrier for arm j in round i). Note that b) is 
determined by Hi, the history prior to time i. This allows us to calculate 

e ( ar) | Hi) = J2pH4 - 4) = (£*W) - 4 

£=1 V=l / 



E(A4|^)=^(l-4)-j = (l-4)^ 



By Lemma mi A& < V$j(C i_1 ) • E ( AC' 



) + r/X. An easy calculation now shows 

K 



V*;^" 1 ) • E (AC* HA = (e, - /(C^ 1 )) ■ £fr 



=i 



6V- 



Z)p$4 



Combining the above, we have 



K 

E (AR) + AS} + AA}- j Hi) < 5^(1 - 6j)4pJ + ??K 
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Consequently, 

T T K 



E E ( AR) + Ad>/ + AA) \ Hi) <J2 E^ 1 " + * KT - 

i=l i=l 1=1 

Since dropping the summands for which p\ < r\ can decrease the total by at most r]KT, we can 
rewrite this as 

T 

E E ( AR) + Ad>/ + A A} \Hi)< E c & + 2r ^ T ' ( 3 ) 

«=1 (i,€)65 

where the index set S is defined by 

S ■ = {(ij): b\ = and p\ > 7].} 
Let a,/3 > 0. Henceforth, our goal will be to prove an upper bound on 

Pr I E <$Pe><* + P 
\(i/)es 

To this end, we will assume without loss of generality that the adversary is such that 

E 4p5 <« + /?, 

(i,e)es 

in which case the probability we are trying to bound from above is 

Pr E 4Pe = a + P 

The intuition for why this assumption is valid is that, in those cases when the sum reaches a + 0, 
the adversary has already succeeded in his goal, so may as well stop. 

For each (i,£) £ S, let x\ denote the indicator random variable for the event that arm £ is 
chosen in round i. For 1 < i < T, let 

z l = E 4(p5-xS)- 

t: (i,£)eS 

Note that Z % takes values in [—1, 1] and that Z 1 ,..., Z T is a martingale difference sequence. An 
easy calculation shows that 

Var(Z*|W i )= E (4)M-( E C ^J ^ E c ^> 



and hence that 

T 



£Var(Z J |Z 1 ,...,Z- 1 )< £ cjp} < a + (5. 
i=l (M)eS 
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By McDiarmid's inequality, Theorem 15.11 applied to Z = Yli=i with a = a, v = a + (3 and 
b = 1, we have 

Pr(Z>a)<exp' 



2/3 + 8q/3 



On the other hand, 



Pf 



J2 4pI = z+y: 

(i,e)es {i,e)es Fi 



= z+ y Pi AA i 

(i,£)eS 

< Z + Y #(4 _1 ) A 4 b y definition of 5. 



(i,£)eS 



Hence, 



Pr[ 4Pl><* + P <Pr(Z>a)+Pr Y g(A i f l )AA} > (5 

M,£)<=s J \(i,£)eS 



a 2 

< exp 



2/3 + 8a/3 



)+Pr( Y M^AA} >/3 I . (4) 



We claim that when /3 = 30, that the last term equals zero. To see this, first note that the 
inequality g{A\) > ^g{A l f ) always holds. Since, for each 1 < £ < K, g{A\) is always a decreasing 
sequence in i, it follows that 

Y g(A\- l )AA\<*- Yl 9{A\)AA\ 

(i,£)eSg{A$- l )>r, (ifi&SgiAY 1 )^ 

K T 



-^^Kii + Aye?/^ 

3K f A "' it 1 

< / — —. dx (since the integrand is a decreasing funct 

2 J K(l + x/9y/ z 



3 f°° 1 

- 2 J (i + x/eyi* 

= 39 - 1 



(i + x/ey/ 2 

= 39. 

This establishes our claim. Setting a = 9, and combining this with (J3J) and @ allows us to conclude 
Pr E ( Ai? i + A$ / + AA j I n i) > ^ ^ exp (-30/26). 



□ 
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8 An adaptive cost schedule: Proof of Theorem 11.21 



In this section, we present a simple family of adaptive cost schedules. We also outline the proof of 
Theorem 11.21 showing that no matter how the parameters of the Exp3 algorithm of Auer et al. 
are set, the expected regret will be f2(T 2 / 3 ) for some cost schedule from this family. 

Let K = 2. (Adding extra arms which always have cost 1 can only increase the regret.) Let 
A be any algorithm for the gambler. For every a £ [0, 1], let V(A, a) be the following adaptive 
strategy for setting costs. Let p denote the conditional probability that A chooses arm 1 at time t, 
given the history of the game on steps 1, . . . , t — 1. Then V(A, a) sets the cost vector c* as follows: 

t ._ \ e 2 if P < a 
1 ei otherwise. 

The motivation behind this adaptive method of setting costs is to encourage algorithm A to 
always move its probability distribution towards (and perhaps beyond) a. Note that this works in 
the case of all the multiplicative weights-based algorithms discussed in this paper; whenever these 
algorithms see a cost of 0, they keep the weight for that arm fixed, but when they see a cost of 1, 
they decrease that arm's weight. 

Observation 8.1. When Exp3 is run against adaptive costs from V(Exp3(7, r/), a) for infinitely 
many steps, the sequence of probabilities p l is uniquely determined modulo consecutive repetitions 
of a single value. Whenever p l < a, p t+1 > p t , and whenever p l > a, p t+1 < p l . 

Proof. At each time step, there are only two random possibilities: a is observed, or a 1. When 
a is observed, the algorithm, and hence the cost sequence, behaves exactly the same as if the 
round had not occurred. This results in a repetition. When a 1 is observed, the algorithm shifts p 
towards (and perhaps beyond) a, by an amount which is uniquely determined. □ 

We now argue that, for every setting of the parameters 7, r/, of the Exp3 algorithm of Auer et 
al. Q][2]; there exists an a £ [0, 1] such that against the adaptive cost schedule V(Exp3(7, 77), a), 
the expected regret of Exp3(7, if) is 0(T 2//3 ). (Note that 7,77 may depend on T.) 

Proof sketch for Theorem ll.ld Suppose Exp3 is run against adaptive costs from V(Exp3(7, 77), a). 
Then, clearly, 

• the loss of the algorithm equals the number of steps when a 1 is observed. 

• The loss of arm 1 equals the number of steps when p < a, 

• and the loss of arm 2 equals the number of steps when p > a. 

We will focus on the case when 7 = rj = 0(T -1//2 ), under which parameters Exp3 has expected 
regret 0{VT) against any non-adaptive adversary. Let a = 37. In this case, it is not hard to see 
that, with high probability, p will lie in the interval [27, 47] for almost all T rounds of the game. 

Moreover, p will cross from greater than a to less than a approximately a.T/2 times, taking one 
big step down, and approximately 1/a small steps up (the exact number is determined, but will 
not concern us). Let us look in more detail at what happens during each such "loop traversal." 

After each downward crossing of a, the algorithm has probability about 1 — a to see a 1 at 
each time step. Since about 1/a upward moves must be made to cross a again, this implies that 
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the number of steps before the next upward crossing of a has expectation about 1/a and variance 
O(l). After each upward crossing of a, the algorithm has probability approximately a to see a 1 at 
each time step. Since only one 1 downward move is needed to cross a again, this implies that the 
number of steps before the next downward crossing of a has expectation about 1/a and variance 
0(l/a 2 ). 

These facts together imply that, in any single loop, the change in arm-specific regret ARi is a 
random variable with mean O(l) and variance 0(1). On the other hand, AR2 is a random variable 
with mean O(l) and variance Q(l/a 2 ) (a shifted exponential distribution). 

If instead of being played for a fixed number of time steps, the game were played for exactly 
aT/2 complete loops (and ignoring the contribution of the few steps before the first downward 
crossing of a), then the total arm-specific regrets would be the sum of aT/2 independent trials of 
these two respective random variables. In this case, the expectations E (i?i) and E (^2) would each 
be 0(aT), however the variances would be Q(aT) and Q(T/a), respectively. 

More to the point, is almost always O(aT), but R2 is £l(\jT/a) with constant probability. 
To see the latter, note that the loss of arm 2 is less than or equal to L if and only if, in L coin flips 
with probability a of heads, at least aT/2 heads come up. The probability of this equals 



j>aT/2 



a J (l — a) 



The desired bound can, with a little work, be inferred using Stirling's formula, or from basic 
properties of binomial distributions; we omit the details from this sketch. 

Now, since R\ is almost always greater than —CaT m —CVT and R2 is with constant proba- 
bility greater than ^jTja « T 3 / 4 , it follows that E (R) = n(^/T~Ja) = ft(T 3 / 4 ). 

Analogous arguments can be made for all other values of 7, 77, proving an n(T 2 / 3 ) lower bound 
in general. Note that if 7 > T -1 / 3 , one should set a < 7, in which case the argument is different 
but easier, and the lower bound is jT. □ 



9 Applications 

The fc-armed bandit has been used as a model for a wide variety of online decision-making problems, 
such as combining expert advice, portfolio balancing, machine learning (boosting), network routing, 
and sequential auctions (among others). In many of these contexts, it would be desirable to provide 
a high-probability guarantee on the actual regret, rather than the expected regret, and/or to relax 
the assumption that the decisions made by the algorithm have no effect on the distribution of 
subsequent incurred costs. The "Accounts" algorithm provides both of these features. 

We plan to add a more detailed description of some of these applications in a later version of 
this paper. 
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